Currently understood, “data science” encompasses a wide range of activities that involve uncovering insights from quantitative information.
Data scientists typically combine specific interests (“domain knowledge”, e.g., biology) with computation, mathematics, and statistics and probability to contribute to knowledge in their communities.
Skills combine in different proportions – no singular background among practitioners.
Diverse communities – science, industry, government, medicine, academia, etc.
Data science lifecycle
There is an emerging consensus that doing data science involves proceeding through a lifecycle: a repeated sequence of steps.
Less consensus at the moment about how many steps and what they are (google ‘data science lifecycle’ and check out all the flowcharts).
Most versions of the ‘data science lifecycle’ involve a few categories of steps:
Project planning
Data collection and organization
Exploration
Analysis
Communication and interpretation
(Perhaps it is this idea of a lifecycle that characterizes data science as distinct from other quantitative fields.)
Case studies: a preview
Case study 1: ACE and health
Association between adverse childhood experiences and general health, by sex.
Case study 1: ACE and health
You will:
process and recode 10K survey responses from CDC’s 2019 behavior risk factor surveillance survey (BRFSS)
cross-tabulate health-related measurements with frequency of adverse childhood experiences
Case study 2: SEDA
Education achievement gaps as functions of socioeconomic indicators, by gender.
Case study 2: SEDA
You will:
merge test scores and socioeconomic indicators from the 2018 Standford Education Data Archive by school district
visually assess correlations between gender achievement gaps among grade schoolers and socioeconomic indicators across school districts in CA
Case study 3: Paleoclimatology
Sea surface temperature reconstruction over the past 16,000 years.
Case study 3: Paleoclimatology
Clustering of diatom relative abundances in pleistocene (pre-11KyBP) vs. holocene (post-11KyBP) epochs.
Case study 3: Paleoclimatology
You will:
explore ecological community structure from relative abundances of diatoms measured in ocean sediment core samples spanning ~15,000 years
use dimension reduction techniques to obtain measures of community structure
identify shifts associated with the transition from pleistocene to holocene epochs
Case study 4: Discrimination at DDS?
Apparent disparity in allocation of DDS benefits across racial groups.
Case study 4: Discrimination at DDS?
Expenditure is strongly associated with age.
Case study 4: Discrimination at DDS?
Correcting for age shows comparable expenditure across racial groups.
Case study 4: Discrimination at DDS?
You will:
assess the case for discrimination in allocation of DDS benefits
identify confounding factors present in the sample
model median expenditure by racial group after correcting for age
About the course
Scope
This course is about developing your data science toolkit with foundational skills:
Core competency with R data science libraries
Critical thinking about data
Visualization and exploratory analysis
Application of statistical concepts and methods in practice
Communication and interpretation of results
Ethical data science
What’s unique about PSTAT100?
There are a few distinctive aspects:
multiple end-to-end case studies
question-driven rather than method-driven
emphasis on project workflow
data storytelling and communication
Limitations
There are also some things we probably won’t cover:
Predictive modeling or machine learning (PSTAT 131)
Especially because we didn’t collect this data ourselves, we should do a little background research to understand where the data came from (Allison et al. 1976) and what limitations might exist:
Information about mammals only \(\longrightarrow\) no information about birds, fish, reptiles, etc.
Species weren’t chosen to represent mammalia \(\longrightarrow\) probably shouldn’t seek to generalize
Averages measured \(\longrightarrow\) ‘aggregated’ data (not individual-level)
So we can only explore the question narrowly for this particular group of animals using the data at hand – we don’t stand to learn anything generalizable.
Not a bad thing! We can still see what the data suggest and use results for hypothesis generation.
Step 3: Tidy
Clean up and organize
This dataset is already impeccably neat: each row is an observation for some mammal, and the columns are the two variables (average weight).
So no tidying needed – we’ll just check the dimensions and see if any values are missing.
# dimensions?dim(bb_weights)
[1] 62 3
# missing values?colSums(is.na(bb_weights))
species body_wt brain_wt
0 0 0
Step 4: Explore
Look for patterns, structure, properites
Visualization is usually a good starting point.
# plotggplot(bb_weights, aes(x = body_wt, y = brain_wt)) +geom_point() +theme_bw(base_size=16)
Step 4: Explore
Step 4: Explore
A simple transformation of the axes reveals a clearer pattern.
So what does that mean in terms of brain and body weights? A little algebra and we have:
\[(\text{brain}) \propto (\text{body})^\alpha\]
This is known as a “power-law” relationship: brain weight changes in proportion to a power of body weight.
So it appears that for these 62 mammals, the brain-body scaling is well-described by a power law. (Notice: no generalization/extrapolation!)
Step 0: Hypothesize
We can now engage in question refinement. Do other classes of animal exhibit the same power law relationship? Is it the same or different from animal to animal?
To investigate, we need richer data.
Step 1: Collect
A number of authors have compiled and published ‘meta-analysis’ datasets by combining the results of multiple studies.
Below we’ll import a few of these for three different animal classes.
# Rename and combine datasetsrept <- reptiles %>%select(all_of(rept_vars)) %>%rename(body =`Body weight (g)`,brain =`Brain weight (g)`) %>%mutate(class ="Reptile")bird <- birds %>%select(all_of(bird_vars)) %>%rename(body =`Body mass (g)`,brain =`Brain mass (g)`) %>%mutate(class ="Bird")mamm <- mammals %>%select(all_of(mammal_vars)) %>%rename(body =`Body mass (g)`,brain =`Brain mass (g)`) %>%mutate(class ="Mammal")# Combine datasetsdata <-bind_rows(rept, mamm, bird)
Step 3: Tidy
In order to combine the datasets:
Select columns of interest;
Put in consistent order;
Give consistent names;
Concatenate.
I’ve suppressed the detail, but we can now inspect the result.
# missing values?colMeans(is.na(data))
Order Family Genus Species Sex body brain class
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.5740405 0.0000000
Step 3: Tidy
This dataset has a number (actually quite a lot) of missing brain weight measurements
Many of the studies combined to form these datasets did not include that particular measurement.
# Aggregate by speciesavg_weights <- data %>%drop_na() %>%group_by(class, Order, Species, Sex) %>%summarise(across(c(body, brain), mean)) %>%ungroup() %>%mutate(log_body =log(body),log_brain =log(brain))
Step 4: Explore
Looking at a similar plot and overlaying trend lines, we see the same power law relationship but with different proportionality constants for the three classes of animal.
# Create final plot with regression linesggplot(avg_weights, aes(x = log_body, y = log_brain, color = class)) +geom_point(alpha =0.2) +geom_smooth(method ="lm", se =FALSE) +theme_minimal()
Step 4: Explore
Step 5: Analyze
So in this case there are three different linear relationships on the log scale that depend on animal class:
It seems that the brain and body weights of the birds, mammals, and reptiles measured in these studies exhibit distinct power law relationships.
What would you investigate next?
Explore further?
Seek data on additional animal classes
Seek data on correlates of body weight
Seek data on other variables (lifespan, habitat, predation, etc.)
Inference and prediction?
Find better generalizable data
Estimate the \(\alpha_i\)’s and \(\beta_i\)’s
Find a way to predict brain weights for unobserved species
Something else?
Lather, rinse, repeat
Hopefully you can see how we could go through multiple iterations of the cycle, continuing to refine the question and produce more detailed analyses each time, until we arrive at a fuller understanding of the subject under study.
A comment
Notice that I did not mention the word ‘model’ anywhere!
This was intentional – it is a common misconception that analyzing data always involves fitting models.
Models are not not always necessary or appropriate
We learned a lot from plots alone
The Role of Models
A statistical model represents a set of assumption about how the data was generated.
Models can infomr statistical tests
Can be used to make predictions or forecasts and describe sources of variability.
Describe more complex aspects of the data that cannot be understood with simple visualizations and exploratory analysis alone
Scope for PSTAT 100
This term we’ll work on developing your data science toolkit with foundational skills:
Programming in R data science libraries
Critical thinking about sampling and generalizing from data
Visualization and exploratory analysis
Ethical data science
Statistical modeling
Throughout, we’ll explore applications of these tools to case studies.
DIKW Pyramid
Data is not information
To generate information from data we need:
Tools to generate, collect, or scrape data
Ability to clean and manipulate data to more usable forms
This class we will use the tidyverse
Information is not knowledge
To generate knowlege from information we need:
Tools for exploratory data analysis
Clustering: identify attributes for grouping distinct subsets of data
Summarizing: compact representation of data (e.g. mean, variance, skew, etc)
Visualization (ggplot)
Information is not knowledge
To generate knowledge we need:
Domain expertise and assumptions
Statistical and machine learning models
Ability to generalize about populations from sample
Ability to quantifying our uncertainty
Knowledge is not Wisdom
Wisdom comes from knowledge when we practice:
Ethical decision making
What are the expected outcomes of each decision?
Might there be unintended consequences?
Who do my decisions help and/or hurt?
Respecting Privacy
What steps do I need to take to keep user data private and secure?
Wisdom and Data Science
Some questions we might ask ourselves throughout the quarter:
Is my data representative of the population?
Does my data have any biases? Measurement error?
Is the data “fair”? How does my analysis affect different groups?
Berkeley Gender Discrimination Example
All
Men
Women
Applicants
Admitted
Applicants
Total
12,763
41%
8,442
Berkeley Gender Discrimination Example
Department
Applicants (All)
Admitted (All)
Applicants (Men)
Admitted (Men)
Applicants (Women)
Admitted (Women)
A
933
64%
825
62%
108
82%
B
585
63%
560
63%
25
68%
C
918
35%
325
37%
593
34%
D
792
34%
417
33%
375
35%
E
584
25%
191
28%
393
24%
F
714
6%
373
6%
341
7%
Total
4,526
39%
2,691
45%
1835
30%
What story does the data tell?
Cancer Deaths in the US
Cancer deaths in the US
Do you think cancer deaths (per 100,000 people) have risen or fallen since 1980?